Learning Convolutional Text Representations for Visual Question Answering

نویسندگان

Zhengyang Wang

Shuiwang Ji

چکیده

Visual question answering is a recently proposed articial intelligence task that requires a deep understanding of both images and texts. In deep learning, images are typically modeled through convolutional neural networks, and texts are typically modeled through recurrent neural networks. While the requirement for modeling images is similar to traditional computer vision tasks, such as object recognition and image classication, visual question answering raises a dierent need for textual representation as compared to other natural language processing tasks. In this work, we perform a detailed analysis on natural language questions in visual question answering. Based on the analysis, we propose to rely on convolutional neural networks for learning textual representations. By exploring the various properties of convolutional neural networks specialized for text data, such as width and depth, we present our “CNN Inception + Gate” model. We show that our model improves question representations and thus the overall accuracy of visual question answering models. We also show that the text representation requirement in visual question answering is more complicated and comprehensive than that in conventional natural language processing tasks, making it a beer task to evaluate textual representation methods. Shallow models like fastText, which can obtain comparable results with deep learning models in tasks like text classication, are not suitable in visual question answering.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Evaluating Multimodal Representations on Sentence Similarity: vSTS, Visual Semantic Textual Similarity Dataset

The success of word representations (embeddings) learned from text has motivated analogous methods to learn representations of longer sequences of text such as sentences, a fundamental step on any task requiring some level of text understanding [13]. Sentence representation is a challenging task that has to consider aspects such as compositionality, phrase similarity, negation, etc. In order to...

متن کامل

Open-Ended Visual Question-Answering

This thesis studies methods to solve Visual Question-Answering (VQA) tasks with a Deep Learning framework. As a preliminary step, we explore Long Short-Term Memory (LSTM) networks used in Natural Language Processing (NLP) to tackle Question-Answering (text based). We then modify the previous model to accept an image as an input in addition to the question. For this purpose, we explore the VGG-1...

متن کامل

Learning Document Image Features With SqueezeNet Convolutional Neural Network

The classification of various document images is considered an important step towards building a modern digital library or office automation system. Convolutional Neural Network (CNN) classifiers trained with backpropagation are considered to be the current state of the art model for this task. However, there are two major drawbacks for these classifiers: the huge computational power demand for...

متن کامل

Convolutional Encoding in Bidirectional Attention Flow for Question Answering

Deep learning systems for complex natural language processing tasks like question answering are often large, cumbersome models that require excessive computational power and time. We seek to address this issue by exploring efficient and parallelizable alternatives to the more computationally expensive components of one of the top-performing question-answering architectures. In particular, we ex...

متن کامل

ABC-CNN: An Attention Based Convolutional Neural Network for Visual Question Answering

We propose a novel attention based deep learning architecture for visual question answering task (VQA). Given an image and an image-related question, VQA returns a natural language answer. Since different questions inquire about the attributes of different image regions, generating correct answers requires the model to have questionguided attention, i.e., the attention on the regions correspond...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1705.06824 شماره

صفحات -

تاریخ انتشار 2017

Learning Convolutional Text Representations for Visual Question Answering

نویسندگان

چکیده

منابع مشابه

Evaluating Multimodal Representations on Sentence Similarity: vSTS, Visual Semantic Textual Similarity Dataset

Open-Ended Visual Question-Answering

Learning Document Image Features With SqueezeNet Convolutional Neural Network

Convolutional Encoding in Bidirectional Attention Flow for Question Answering

ABC-CNN: An Attention Based Convolutional Neural Network for Visual Question Answering

عنوان ژورنال:

اشتراک گذاری